Interactive Data Visualization with Bokeh

Bokeh is an interactive Python library for visualizations that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

  • To get started using Bokeh to make your visualizations, see the User Guide.
  • To see examples of how you might use Bokeh with your own data, check out the Gallery.
  • A complete API reference of Bokeh is at Reference Guide.

The following notebook is intended to illustrate some of Bokeh's interactive utilities and is based on a post by software engineer and Bokeh developer Sarah Bird.

Recreating Gapminder's "The Health and Wealth of Nations"

Gapminder started as a spin-off from Professor Hans Rosling’s teaching at the Karolinska Institute in Stockholm. Having encountered broad ignorance about the rapid health improvement in Asia, he wanted to measure that lack of awareness among students and professors. He presented the surprising results from his so-called “Chimpanzee Test” in his first TED-talk in 2006.

Rosling's interactive "Health and Wealth of Nations" visualization has since become an iconic illustration of how our assumptions about ‘first world’ and ‘third world’ countries can betray us. Mike Bostock has recreated the visualization using D3.js, and in this lab, we will see that it is also possible to use Bokeh to recreate the interactive visualization in Python.

About Bokeh Widgets

Widgets are interactive controls that can be added to Bokeh applications to provide a front end user interface to a visualization. They can drive new computations, update plots, and connect to other programmatic functionality. When used with the Bokeh server, widgets can run arbitrary Python code, enabling complex applications. Widgets can also be used without the Bokeh server in standalone HTML documents through the browser’s Javascript runtime.

To use widgets, you must add them to your document and define their functionality. Widgets can be added directly to the document root or nested inside a layout. There are two ways to program a widget’s functionality:

  • Use the CustomJS callback (see CustomJS for Widgets. This will work in standalone HTML documents.
  • Use bokeh serve to start the Bokeh server and set up event handlers with .on_change (or for some widgets, .on_click).

Imports


In [15]:
# Science Stack 
import numpy as np
import pandas as pd

# Bokeh Essentials 
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource

# Layouts 
from bokeh.layouts import layout
from bokeh.layouts import widgetbox

# Figure interaction layer
from bokeh.io import show
from bokeh.io import output_notebook 

# Data models for visualization 
from bokeh.models import Text
from bokeh.models import Plot
from bokeh.models import Slider
from bokeh.models import Circle
from bokeh.models import Range1d
from bokeh.models import CustomJS
from bokeh.models import HoverTool
from bokeh.models import LinearAxis
from bokeh.models import ColumnDataSource
from bokeh.models import SingleIntervalTicker

# Palettes and colors
from bokeh.palettes import brewer
from bokeh.palettes import Spectral6

To display Bokeh plots inline in a Jupyter notebook, use the output_notebook() function from bokeh.io. When show() is called, the plot will be displayed inline in the next notebook output cell. To save your Bokeh plots, you can use the output_file() function instead (or in addition). The output_file() function will write an HTML file to disk that can be opened in a browser.


In [2]:
# Load Bokeh for visualization
output_notebook()


Loading BokehJS ...

Get the data

Some of Bokeh examples rely on sample data that is not included in the Bokeh GitHub repository or released packages, due to their size. Once Bokeh is installed, the sample data can be obtained by executing the command in the next cell. The location that the sample data is stored can be configured. By default, data is downloaded and stored to a directory $HOME/.bokeh/data. (The directory is created if it does not already exist.)


In [3]:
import bokeh.sampledata
bokeh.sampledata.download()


Creating /Users/benjamin/.bokeh directory
Creating /Users/benjamin/.bokeh/data directory
Using data directory: /Users/benjamin/.bokeh/data
Downloading: CGM.csv (1589982 bytes)
   1589982 [100.00%]
Downloading: US_Counties.zip (3182088 bytes)
   3182088 [100.00%]
Unpacking: US_Counties.csv
Downloading: us_cities.json (713565 bytes)
    713565 [100.00%]
Downloading: unemployment09.csv (253301 bytes)
    253301 [100.00%]
Downloading: AAPL.csv (166698 bytes)
    166698 [100.00%]
Downloading: FB.csv (9706 bytes)
      9706 [100.00%]
Downloading: GOOG.csv (113894 bytes)
    113894 [100.00%]
Downloading: IBM.csv (165625 bytes)
    165625 [100.00%]
Downloading: MSFT.csv (161614 bytes)
    161614 [100.00%]
Downloading: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.zip (5148539 bytes)
   5148539 [100.00%]
Unpacking: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.csv
Downloading: gapminder_fertility.csv (64346 bytes)
     64346 [100.00%]
Downloading: gapminder_population.csv (94509 bytes)
     94509 [100.00%]
Downloading: gapminder_life_expectancy.csv (73243 bytes)
     73243 [100.00%]
Downloading: gapminder_regions.csv (7781 bytes)
      7781 [100.00%]
Downloading: world_cities.zip (646858 bytes)
    646858 [100.00%]
Unpacking: world_cities.csv
Downloading: airports.json (6373 bytes)
      6373 [100.00%]
Downloading: movies.db.zip (5067833 bytes)
   5067833 [100.00%]
Unpacking: movies.db

Prepare the data

In order to create an interactive plot in Bokeh, we need to animate snapshots of the data over time from 1964 to 2013. In order to do this, we can think of each year as a separate static plot. We can then use a JavaScript Callback to change the data source that is driving the plot.

JavaScript Callbacks

Bokeh exposes various callbacks, which can be specified from Python, that trigger actions inside the browser’s JavaScript runtime. This kind of JavaScript callback can be used to add interesting interactions to Bokeh documents without the need to use a Bokeh server (but can also be used in conjuction with a Bokeh server). Custom callbacks can be set using a CustomJS object and passing it as the callback argument to a Widget object.

As the data we will be using today is not too big, we can pass all the datasets to the JavaScript at once and switch between them on the client side using a slider widget.

This means that we need to put all of the datasets together build a single data source for each year. First we will load each of the datasets with the process_data() function and do a bit of clean up:


In [5]:
def process_data():
    
    # Import the gap minder data sets
    from bokeh.sampledata.gapminder import fertility, life_expectancy, population, regions
    
    # The columns are currently string values for each year, 
    # make them ints for data processing and visualization.
    columns = list(fertility.columns)
    years = list(range(int(columns[0]), int(columns[-1])))
    rename_dict = dict(zip(columns, years))
    
    # Apply the integer year columna names to the data sets. 
    fertility = fertility.rename(columns=rename_dict)
    life_expectancy = life_expectancy.rename(columns=rename_dict)
    population = population.rename(columns=rename_dict)
    regions = regions.rename(columns=rename_dict)

    # Turn population into bubble sizes. Use min_size and factor to tweak.
    scale_factor = 200
    population_size = np.sqrt(population / np.pi) / scale_factor
    min_size = 3
    population_size = population_size.where(population_size >= min_size).fillna(min_size)

    # Use pandas categories and categorize & color the regions
    regions.Group = regions.Group.astype('category')
    regions_list = list(regions.Group.cat.categories)

    def get_color(r):
        return Spectral6[regions_list.index(r.Group)]
    regions['region_color'] = regions.apply(get_color, axis=1)

    return fertility, life_expectancy, population_size, regions, years, regions_list

Next we will add each of our sources to the sources dictionary, where each key is the name of the year (prefaced with an underscore) and each value is a dataframe with the aggregated values for that year.

Note that we needed the prefixing as JavaScript objects cannot begin with a number.


In [6]:
# Process the data and fetch the data frames and lists 
fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions = process_data()

# Create a data source dictionary whose keys are prefixed years
# and whose values are ColumnDataSource objects that merge the 
# various per-year values from each data frame. 
sources = {}

# Quick helper variables 
region_color = regions_df['region_color']
region_color.name = 'region_color'

# Create a source for each year. 
for year in years:
    # Extract the fertility for each country for this year.
    fertility = fertility_df[year]
    fertility.name = 'fertility'
    
    # Extract life expectancy for each country for this year. 
    life = life_expectancy_df[year]
    life.name = 'life' 
    
    # Extract the normalized population size for each country for this year. 
    population = population_df_size[year]
    population.name = 'population' 
    
    # Create a dataframe from our extraction and add to our sources 
    new_df = pd.concat([fertility, life, population, region_color], axis=1)
    sources['_' + str(year)] = ColumnDataSource(new_df)

You can see what's in the sources dictionary by running the cell below.

Later we will be able to pass this sources dictionary to the JavaScript Callback. In so doing, we will find that in our JavaScript we have objects named by year that refer to a corresponding ColumnDataSource.


In [7]:
sources


Out[7]:
{'_1964': ColumnDataSource(id='128797c9-c1ff-40ef-8cca-924aa30eaa9e', ...),
 '_1965': ColumnDataSource(id='fcaf9623-f5ab-4366-ba57-1dc23b920003', ...),
 '_1966': ColumnDataSource(id='dbafbed0-9bac-4de2-8032-4536df3773a5', ...),
 '_1967': ColumnDataSource(id='638521cd-3270-450b-aa32-b538846f2a07', ...),
 '_1968': ColumnDataSource(id='2339feb4-91cb-4c36-9564-88a973a1cccb', ...),
 '_1969': ColumnDataSource(id='4b6a7ad2-99b3-44fa-b50a-2adcf18b80a8', ...),
 '_1970': ColumnDataSource(id='c7f48b7f-9985-4035-a73b-b144827ab913', ...),
 '_1971': ColumnDataSource(id='4704bc4d-1f2c-4fbe-b11e-b613017e557d', ...),
 '_1972': ColumnDataSource(id='8e4ac357-c1e5-4078-b17e-ff2668c2c2ba', ...),
 '_1973': ColumnDataSource(id='432483b6-c175-453a-8f7c-0495bdea8797', ...),
 '_1974': ColumnDataSource(id='59d7244c-f030-4347-aa76-b73a1c166359', ...),
 '_1975': ColumnDataSource(id='7eb64100-122d-4806-a464-90291155f3c8', ...),
 '_1976': ColumnDataSource(id='3e0e0d94-8a5b-40e8-a097-eea8df436477', ...),
 '_1977': ColumnDataSource(id='819fdb97-2d8a-49c5-9379-02188d9b7e35', ...),
 '_1978': ColumnDataSource(id='f07b28e1-4d00-4d8e-a76f-f50a526c791d', ...),
 '_1979': ColumnDataSource(id='f9ba4f95-2257-4015-ba92-9158ec431fd6', ...),
 '_1980': ColumnDataSource(id='c91bbe51-f2f9-40e4-8a92-14d7bc32963a', ...),
 '_1981': ColumnDataSource(id='798d83cb-2b3b-47c6-8d98-a361032537f8', ...),
 '_1982': ColumnDataSource(id='c0e69480-474e-485e-b334-65ec1797d277', ...),
 '_1983': ColumnDataSource(id='ef414232-e7c7-4d2f-a532-ce3f89ec85c9', ...),
 '_1984': ColumnDataSource(id='d730c8eb-d0b5-4ef3-84bb-be25a579f996', ...),
 '_1985': ColumnDataSource(id='a49d0414-8d41-46ce-b7de-2b6a034e7925', ...),
 '_1986': ColumnDataSource(id='44fb667e-a0b1-412a-b5d7-aad9a67b1138', ...),
 '_1987': ColumnDataSource(id='f405735d-0d2e-48b8-9e61-1cae920f8565', ...),
 '_1988': ColumnDataSource(id='1821bf62-fe61-4006-be2a-9ac91c089cd7', ...),
 '_1989': ColumnDataSource(id='b131dc0d-f790-47e5-9e96-559120d409c7', ...),
 '_1990': ColumnDataSource(id='bfe113ac-abd4-4fbd-bb59-f1e7805a278f', ...),
 '_1991': ColumnDataSource(id='9cbe0f63-cfd8-4831-9553-0e1816413a6c', ...),
 '_1992': ColumnDataSource(id='01086edd-7aff-4b4a-bd9b-b6c6bee653a1', ...),
 '_1993': ColumnDataSource(id='08ee3746-9b18-4971-817d-fc39d2f00ac7', ...),
 '_1994': ColumnDataSource(id='9a38ce2e-abd0-490d-9bd7-e3290afaa45d', ...),
 '_1995': ColumnDataSource(id='d7566b0a-c44b-44df-a6f1-338af7427256', ...),
 '_1996': ColumnDataSource(id='49f6af2d-d294-45e1-9818-2a289501531b', ...),
 '_1997': ColumnDataSource(id='40537e00-289f-4963-aceb-9db53f086319', ...),
 '_1998': ColumnDataSource(id='29900086-9343-4e72-b970-3f6998f00fb7', ...),
 '_1999': ColumnDataSource(id='8466426f-f5c3-45a9-be15-7d4afc5bc8c3', ...),
 '_2000': ColumnDataSource(id='c68b2c66-a620-4330-a28c-06bf22966e59', ...),
 '_2001': ColumnDataSource(id='c1ddaaf2-0d47-47a6-be89-8451320b07a3', ...),
 '_2002': ColumnDataSource(id='b22a9525-f422-462d-98b3-e3257b9f1361', ...),
 '_2003': ColumnDataSource(id='f6c2e16b-6881-497a-a441-95c75cc0085d', ...),
 '_2004': ColumnDataSource(id='a3f3a02f-3b8a-4ca4-8818-a221fd85808d', ...),
 '_2005': ColumnDataSource(id='2988e9ca-5daa-4f58-afe9-a5b8533a7b72', ...),
 '_2006': ColumnDataSource(id='d0c61b85-a640-43a6-ae00-3fdb912c2bef', ...),
 '_2007': ColumnDataSource(id='1fdb0348-6cee-4983-a954-1b15514954ae', ...),
 '_2008': ColumnDataSource(id='74df0458-c72f-4b82-a328-88eb34a186d0', ...),
 '_2009': ColumnDataSource(id='f102827a-cb23-489b-a2ec-1a44509e9d77', ...),
 '_2010': ColumnDataSource(id='f3f99dde-4221-4158-b9f9-329a12c83a2a', ...),
 '_2011': ColumnDataSource(id='872d54a0-ab97-41cb-bbda-ea0df8e6b485', ...),
 '_2012': ColumnDataSource(id='a0fe0918-569c-446f-9029-7df5ee7ddf8a', ...)}

We can also create a corresponding dictionary_of_sources object, where the keys are integers and the values are the references to our ColumnDataSources from above:


In [8]:
dictionary_of_sources = dict(zip([x for x in years], ['_%s' % x for x in years]))

In [14]:
js_source_array = str(dictionary_of_sources).replace("'", "")
js_source_array


Out[14]:
'{1964: _1964, 1965: _1965, 1966: _1966, 1967: _1967, 1968: _1968, 1969: _1969, 1970: _1970, 1971: _1971, 1972: _1972, 1973: _1973, 1974: _1974, 1975: _1975, 1976: _1976, 1977: _1977, 1978: _1978, 1979: _1979, 1980: _1980, 1981: _1981, 1982: _1982, 1983: _1983, 1984: _1984, 1985: _1985, 1986: _1986, 1987: _1987, 1988: _1988, 1989: _1989, 1990: _1990, 1991: _1991, 1992: _1992, 1993: _1993, 1994: _1994, 1995: _1995, 1996: _1996, 1997: _1997, 1998: _1998, 1999: _1999, 2000: _2000, 2001: _2001, 2002: _2002, 2003: _2003, 2004: _2004, 2005: _2005, 2006: _2006, 2007: _2007, 2008: _2008, 2009: _2009, 2010: _2010, 2011: _2011, 2012: _2012}'

Now we have an object that's storing all of our ColumnDataSources, so that we can look them up.

Build the plot

First we need to create a Plot object. We'll start with a basic frame, only specifying things like plot height, width, and ranges for the axes.


In [16]:
xdr = Range1d(1, 9)
ydr = Range1d(20, 100)

plot = Plot(
    x_range=xdr,
    y_range=ydr,
    plot_width=800,
    plot_height=400,
    outline_line_color=None,
    toolbar_location=None, 
    min_border=20,
)

In order to display the plot in the notebook use the show() function:


In [19]:
# show(plot)

Build the axes

Next we can make some stylistic modifications to the plot axes (e.g. by specifying the text font, size, and color, and by adding labels), to make the plot look more like the one in Hans Rosling's video.


In [20]:
# Create a dictionary of our common setting. 
AXIS_FORMATS = dict(
    minor_tick_in=None,
    minor_tick_out=None,
    major_tick_in=None,
    major_label_text_font_size="10pt",
    major_label_text_font_style="normal",
    axis_label_text_font_size="10pt",

    axis_line_color='#AAAAAA',
    major_tick_line_color='#AAAAAA',
    major_label_text_color='#666666',

    major_tick_line_cap="round",
    axis_line_cap="round",
    axis_line_width=1,
    major_tick_line_width=1,
)


# Create two axis models for the x and y axes. 
xaxis = LinearAxis(
    ticker=SingleIntervalTicker(interval=1), 
    axis_label="Children per woman (total fertility)", 
    **AXIS_FORMATS
)

yaxis = LinearAxis(
    ticker=SingleIntervalTicker(interval=20), 
    axis_label="Life expectancy at birth (years)", 
    **AXIS_FORMATS
)   

# Add the axes to the plot in the specified positions.
plot.add_layout(xaxis, 'below')
plot.add_layout(yaxis, 'left')

Go ahead and experiment with visualizing each step of the building process and changing various settings.


In [22]:
# show(plot)

Add the background year text

One of the features of Rosling's animation is that the year appears as the text background of the plot. We will add this feature to our plot first so it will be layered below all the other glyphs (will will be incrementally added, layer by layer, on top of each other until we are finished).


In [23]:
# Create a data source for each of our years to display. 
text_source = ColumnDataSource({'year': ['%s' % years[0]]})

# Create a text object model and add to the figure. 
text = Text(x=2, y=35, text='year', text_font_size='150pt', text_color='#EEEEEE')
plot.add_glyph(text_source, text)


Out[23]:
GlyphRenderer(
id = '92046da9-54f9-4a43-9ad4-76e4c959a48e', …)

In [25]:
# show(plot)

Add the bubbles and hover

Next we will add the bubbles using Bokeh's Circle glyph. We start from the first year of data, which is our source that drives the circles (the other sources will be used later).


In [26]:
# Select the source for the first year we have. 
renderer_source = sources['_%s' % years[0]]

# Create a circle glyph to generate points for the scatter plot. 
circle_glyph = Circle(
    x='fertility', y='life', size='population',
    fill_color='region_color', fill_alpha=0.8, 
    line_color='#7c7e71', line_width=0.5, line_alpha=0.5
)

# Connect the glyph generator to the data source and add to the plot
circle_renderer = plot.add_glyph(renderer_source, circle_glyph)

In the above, plot.add_glyph returns the renderer, which we can then pass to the HoverTool so that hover only happens for the bubbles on the page and not other glyph elements:


In [27]:
# Add the hover (only against the circle and not other plot elements)
tooltips = "@index"
plot.add_tools(HoverTool(tooltips=tooltips, renderers=[circle_renderer]))

Test out different parameters for the Circle glyph and see how it changes the plot:


In [29]:
# show(plot)

Add the legend

Next we will manually build a legend for our plot by adding circles and texts to the upper-righthand portion:


In [31]:
# Position of the legend 
text_x = 7
text_y = 95

# For each region, add a circle with the color and text. 
for i, region in enumerate(regions):
    plot.add_glyph(Text(x=text_x, y=text_y, text=[region], text_font_size='10pt', text_color='#666666'))
    plot.add_glyph(
        Circle(x=text_x - 0.1, y=text_y + 2, fill_color=Spectral6[i], size=10, line_color=None, fill_alpha=0.8)
    )
    
    # Move the y coordinate down a bit.
    text_y = text_y - 5

In [33]:
# show(plot)

Add the slider and callback

Next we add the slider widget and the JavaScript callback code, which changes the data of the renderer_source (powering the bubbles / circles) and the data of the text_source (powering our background text). After we've set() the data we need to trigger() a change. slider, renderer_source, text_source are all available because we add them as args to Callback.

It is the combination of sources = %s % (js_source_array) in the JavaScript and Callback(args=sources...) that provides the ability to look-up, by year, the JavaScript version of our Python-made ColumnDataSource.


In [34]:
# Add the slider
code = """
    var year = slider.get('value'),
        sources = %s,
        new_source_data = sources[year].get('data');
    renderer_source.set('data', new_source_data);
    text_source.set('data', {'year': [String(year)]});
""" % js_source_array

callback = CustomJS(args=sources, code=code)
slider = Slider(start=years[0], end=years[-1], value=1, step=1, title="Year", callback=callback)
callback.args["renderer_source"] = renderer_source
callback.args["slider"] = slider
callback.args["text_source"] = text_source

In [37]:
# show(widgetbox(slider))

Putting all the pieces together

Last but not least, we put the chart and the slider together in a layout and display it inline in the notebook.


In [38]:
show(layout([[plot], [slider]], sizing_mode='scale_width'))


I hope that you'll use Bokeh to produce interactive visualizations for visual analysis:

Topic Model Visualization

In this section we'll take a look at visualizing a corpus by exploring clustering and dimensionality reduction techniques. Text analysis is certainly high dimensional visualization and this can be applied to other data sets as well.

The first step is to load our documents from disk and vectorize them using Gensim. This content is a bit beyond the scope of the workshop for today, however I did want to provide code for reference, and I'm happy to go over it offline.


In [3]:
import nltk 
import string
import pickle
import gensim
import random 

from operator import itemgetter
from collections import defaultdict 
from nltk.corpus import wordnet as wn
from gensim.matutils import sparse2full
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.api import CategorizedCorpusReader

CORPUS_PATH = "data/baleen_sample"
PKL_PATTERN = r'(?!\.)[a-z_\s]+/[a-f0-9]+\.pickle'
CAT_PATTERN = r'([a-z_\s]+)/.*'


/usr/local/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")

In [4]:
class PickledCorpus(CategorizedCorpusReader, CorpusReader):
    
    def __init__(self, root, fileids=PKL_PATTERN, cat_pattern=CAT_PATTERN):
        CategorizedCorpusReader.__init__(self, {"cat_pattern": cat_pattern})
        CorpusReader.__init__(self, root, fileids)
        
        self.punct = set(string.punctuation) | {'“', '—', '’', '”', '…'}
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.wordnet = nltk.WordNetLemmatizer() 
    
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError("Specify fileids or categories, not both")

        if categories is not None:
            return self.fileids(categories=categories)
        return fileids
    
    def lemmatize(self, token, tag):
        token = token.lower()
        
        if token not in self.stopwords:
            if not all(c in self.punct for c in token):
                tag =  {
                    'N': wn.NOUN,
                    'V': wn.VERB,
                    'R': wn.ADV,
                    'J': wn.ADJ
                }.get(tag[0], wn.NOUN)
                return self.wordnet.lemmatize(token, tag)
    
    def tokenize(self, doc):
        # Expects a preprocessed document, removes stopwords and punctuation
        # makes all tokens lowercase and lemmatizes them. 
        return list(filter(None, [
            self.lemmatize(token, tag)
            for paragraph in doc 
            for sentence in paragraph 
            for token, tag in sentence 
        ]))
    
    def docs(self, fileids=None, categories=None):
        # Resolve the fileids and the categories
        fileids = self._resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, enc, fileid in self.abspaths(fileids, True, True):
            with open(path, 'rb') as f:
                yield self.tokenize(pickle.load(f))

The PickledCorpus is a Python class that reads a continuous stream of pickle files from disk. The files themselves are preprocessed documents from RSS feeds in various topics (and is actually just a small sample of the documents that are in the larger corpus). If you're interestd in the ingestion and curation of this corpus, see baleen.districtdatalabs.com.

Just to get a feel for this data set, I'll load the corpus and print out the number of documents per category:


In [5]:
# Create the Corpus Reader
corpus = PickledCorpus(CORPUS_PATH)

In [6]:
# Count the total number of documents
total_docs = 0

# Count the number of documents per category. 
for category in corpus.categories():
    num_docs = sum(1 for doc in corpus.fileids(categories=[category]))
    total_docs += num_docs 
    
    print("{}: {:,} documents".format(category, num_docs))
    
print("\n{:,} documents in the corpus".format(total_docs))


books: 71 documents
business: 389 documents
cinema: 100 documents
cooking: 30 documents
data_science: 41 documents
design: 55 documents
do_it_yourself: 122 documents
gaming: 128 documents
news: 1,159 documents
politics: 149 documents
sports: 118 documents
tech: 176 documents

2,538 documents in the corpus

Our corpus reader object handles text preprocessing with NLTK (the natural language toolkit), namely by converting each document as follows:

  • tokenizing the document
  • making all tokens lower case
  • removes stopwords and punctuation
  • converts words to their lemma

Here is an example document:


In [7]:
fid = random.choice(corpus.fileids())
doc = next(corpus.docs(fileids=[fid]))
print(" ".join(doc))


car bomb explosion turkish capital ankara leave least 34 people dead 100 injured accord turkey health minister today attack targeted civilian bus stop interior minister efkan ala say health minister mehmet muezzinoglu say 30 victim die scene four die hospital muezzinoglu also say 125 people wound 19 serious condition united state condemn attack take innocent life injured score national security council spokesman ned price say thought prayer go kill injure well love one price statement say horrific act recent many terrorist attack perpetrate turkish people united state stand together turkey nato ally value partner confront scourge terrorism explosion occur city main boulevard ataturk bulvari near city main square kizilay associated press report two day ago u embassy say potential terrorist plot attack turkish government building housing locate bahcelievler area ankara u embassy say american avoid area immediately clear whether u embassy warning relate attack associated press contribute report

The next step is to convert these documents into vectors so that we can apply machine learning. We'll use a bag-of-words (bow) model with TF-IDF, implemented by the Gensim library.


In [8]:
# Create the lexicon from the corpus 
lexicon = gensim.corpora.Dictionary(corpus.docs())

# Create the document vectors 
docvecs = [lexicon.doc2bow(doc) for doc in corpus.docs()]

# Train the TF-IDF model and convert vectors to TF-IDF
tfidf = gensim.models.TfidfModel(docvecs, id2word=lexicon, normalize=True)
tfidfvecs = [tfidf[doc] for doc in docvecs]

# Save the lexicon and TF-IDF model to disk.
lexicon.save('data/topics/lexicon.dat')
tfidf.save('data/topics/tfidf_model.pkl')

Documents are now described by the words that are most important to that document relative to the rest of the corpus. The document above has been transformed into the following vector with associated weights:


In [9]:
# Covert random document from above into TF-IDF vector 
dv = tfidf[lexicon.doc2bow(doc)]

# Print the document terms and their weights. 
print(" ".join([
    "{} ({:0.2f})".format(lexicon[tid], score)
    for tid, score in sorted(dv, key=itemgetter(1), reverse=True)
]))


embassy (0.29) muezzinoglu (0.27) turkish (0.26) attack (0.22) ankara (0.20) injured (0.18) minister (0.17) explosion (0.16) bahcelievler (0.15) turkey (0.15) efkan (0.14) bulvari (0.14) kizilay (0.14) ataturk (0.14) terrorist (0.14) scourge (0.13) perpetrate (0.13) ned (0.12) targeted (0.12) mehmet (0.12) associated (0.11) main (0.11) boulevard (0.11) health (0.11) ala (0.11) nato (0.10) horrific (0.10) price (0.10) 125 (0.10) die (0.10) prayer (0.10) innocent (0.09) united (0.09) interior (0.09) area (0.09) condemn (0.08) confront (0.08) press (0.08) civilian (0.08) housing (0.08) wound (0.08) terrorism (0.08) plot (0.08) bus (0.08) warning (0.08) injure (0.07) city (0.07) council (0.07) locate (0.07) 34 (0.07) bomb (0.07) ally (0.07) hospital (0.07) square (0.07) occur (0.06) thought (0.06) say (0.06) score (0.06) victim (0.06) dead (0.06) relate (0.06) condition (0.06) spokesman (0.06) 19 (0.06) u (0.06) avoid (0.06) building (0.06) serious (0.06) people (0.06) scene (0.06) partner (0.06) value (0.05) immediately (0.05) potential (0.05) report (0.05) capital (0.05) car (0.05) contribute (0.05) 100 (0.05) act (0.05) near (0.05) state (0.05) together (0.05) security (0.05) kill (0.05) stand (0.04) love (0.04) stop (0.04) clear (0.04) 30 (0.04) ago (0.04) recent (0.04) statement (0.04) national (0.04) whether (0.04) least (0.04) today (0.04) american (0.04) government (0.04) four (0.03) life (0.03) leave (0.03) accord (0.03) many (0.02) well (0.02) day (0.02) two (0.02) go (0.02) take (0.02) also (0.01) one (0.01)

Topic Visualization with LDA

We have a lot of documents in our corpus, so let's see if we can cluster them into related topics using the Latent Dirichlet Model that comes with Gensim. This model is widely used for "topic modeling" -- that is clustering on documents.


In [10]:
# Select the number of topics to train the model on.
NUM_TOPICS = 10 

# Create the LDA model from the docvecs corpus and save to disk.
model = gensim.models.LdaModel(docvecs, id2word=lexicon, alpha='auto', num_topics=NUM_TOPICS)
model.save('data/topics/lda_model.pkl')

Each topic is represented as a vector - where each word is a dimension and the probability of that word beloning to the topic is the value. We can use the model to query the topics for a document, our random document from above is assigned the following topics with associated probabilities:


In [11]:
model[lexicon.doc2bow(doc)]


Out[11]:
[(2, 0.72882756700044149), (8, 0.2632769507616482)]

We can assign the most probable topic to each document in our corpus by selecting the topic with the maximal probability:


In [12]:
topics = [
    max(model[doc], key=itemgetter(1))[0]
    for doc in docvecs
]

Topics themselves can be described by their highest probability words:


In [13]:
for tid, topic in model.print_topics():
    print("Topic {}:\n{}\n".format(tid, topic))


Topic 0:
0.010*"game" + 0.007*"say" + 0.006*"team" + 0.005*"get" + 0.005*"one" + 0.005*"season" + 0.005*"go" + 0.005*"first" + 0.005*"make" + 0.005*"new"

Topic 1:
0.007*"data" + 0.006*"say" + 0.004*"one" + 0.004*"use" + 0.004*"also" + 0.003*"make" + 0.003*"like" + 0.003*"people" + 0.003*"new" + 0.003*"find"

Topic 2:
0.009*"say" + 0.006*"year" + 0.005*"one" + 0.004*"people" + 0.004*"state" + 0.004*"two" + 0.003*"eng" + 0.003*"also" + 0.003*"time" + 0.003*"get"

Topic 3:
0.011*"say" + 0.008*"year" + 0.004*"state" + 0.003*"take" + 0.003*"also" + 0.003*"make" + 0.003*"time" + 0.003*"would" + 0.003*"go" + 0.003*"new"

Topic 4:
0.014*"trump" + 0.012*"say" + 0.005*"republican" + 0.005*"one" + 0.005*"get" + 0.004*"go" + 0.004*"like" + 0.004*"clinton" + 0.004*"make" + 0.004*"state"

Topic 5:
0.006*"one" + 0.005*"make" + 0.005*"may" + 0.004*"time" + 0.004*"say" + 0.004*"get" + 0.004*"1" + 0.003*"like" + 0.003*"take" + 0.003*"two"

Topic 6:
0.011*"say" + 0.006*"trump" + 0.005*"new" + 0.005*"year" + 0.004*"make" + 0.004*"get" + 0.004*"state" + 0.003*"one" + 0.003*"would" + 0.003*"time"

Topic 7:
0.015*"say" + 0.007*"year" + 0.005*"mr" + 0.004*"state" + 0.004*"also" + 0.004*"one" + 0.004*"go" + 0.004*"make" + 0.004*"people" + 0.003*"would"

Topic 8:
0.012*"say" + 0.006*"year" + 0.005*"one" + 0.004*"make" + 0.004*"would" + 0.004*"u" + 0.004*"get" + 0.004*"company" + 0.004*"new" + 0.004*"time"

Topic 9:
0.009*"say" + 0.004*"make" + 0.004*"one" + 0.004*"year" + 0.004*"like" + 0.004*"would" + 0.004*"new" + 0.003*"company" + 0.003*"use" + 0.003*"people"

We can plot each topic by using decomposition methods (TruncatedSVD in this case) to reduce the probability vector for each topic into 2 dimensions, then size the radius of each topic according to how much probability documents it contains donates to it. Also try with PCA, explored below!


In [14]:
# Create a sum dictionary that adds up the total probability 
# of each document in the corpus to each topic. 
tsize = defaultdict(float)
for doc in docvecs:
    for tid, prob in model[doc]:
        tsize[tid] += prob

In [15]:
# Create a numpy array of topic vectors where each vector 
# is the topic probability of all terms in the lexicon. 
tvecs = np.array([
    sparse2full(model.get_topic_terms(tid, len(lexicon)), len(lexicon)) 
    for tid in range(NUM_TOPICS)
])

In [16]:
# Import the model family 
from sklearn.decomposition import TruncatedSVD 

# Instantiate the model form, fit and transform 
topic_svd = TruncatedSVD(n_components=2)
svd_tvecs = topic_svd.fit_transform(tvecs)

In [17]:
# Create the Bokeh columnar data source with our various elements. 
# Note the resize/normalization of the topics so the radius of our
# topic circles fits int he graph a bit better. 
tsource = ColumnDataSource(
        data=dict(
            x=svd_tvecs[:, 0],
            y=svd_tvecs[:, 1],
            w=[model.print_topic(tid, 10) for tid in range(10)],
            c=brewer['Spectral'][10],
            r=[tsize[idx]/700000.0 for idx in range(10)],
        )
    )

# Create the hover tool so that we can visualize the topics. 
hover = HoverTool(
        tooltips=[
            ("Words", "@w"),
        ]
    )


# Create the figure to draw the graph on. 
plt = figure(
    title="Topic Model Decomposition", 
    width=960, height=540, 
    tools="pan,box_zoom,reset,resize,save"
)

# Add the hover tool 
plt.add_tools(hover)

# Plot the SVD topic dimensions as a scatter plot 
plt.scatter(
    'x', 'y', source=tsource, size=9,
    radius='r', line_color='c', fill_color='c',
    marker='circle', fill_alpha=0.85,
)

# Show the plot to render the JavaScript 
show(plt)


Corpus Visualization with PCA

The bag of words model means that every token (string representation of a word) is a dimension and a document is represented by a vector that maps the relative weight of that dimension to the document by the TF-IDF metric. In order to visualize documents in this high dimensional space, we must use decomposition methods to reduce the dimensionality to something we can plot.

One good first attempt is toi use principle component analysis (PCA) to reduce the data set dimensions (the number of vocabulary words in the corpus) to 2 dimensions in order to map the corpus as a scatter plot.

We'll use the Scikit-Learn PCA transformer to do this work:


In [18]:
# In order to use Scikit-Learn we need to transform Gensim vectors into a numpy Matrix. 
docarr = np.array([sparse2full(vec, len(lexicon)) for vec in tfidfvecs])

In [19]:
# Import the model family 
from sklearn.decomposition import PCA 

# Instantiate the model form, fit and transform 
tfidf_pca = PCA(n_components=2)
pca_dvecs = topic_svd.fit_transform(docarr)

We can now use Bokeh to create an interactive plot that will allow us to explore documents according to their position in decomposed TF-IDF space, coloring by their topic.


In [20]:
# Create a map using the ColorBrewer 'Paired' Palette to assign 
# Topic IDs to specific colors. 
cmap = {
    i: brewer['Paired'][10][i]
    for i in range(10)
}

# Create a tokens listing for our hover tool. 
tokens = [
    " ".join([
        lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)
    ][:10])
    for doc in tfidfvecs
]

# Create a Bokeh tabular data source to describe the data we've created. 
source = ColumnDataSource(
        data=dict(
            x=pca_dvecs[:, 0],
            y=pca_dvecs[:, 1],
            w=tokens,
            t=topics,
            c=[cmap[t] for t in topics],
        )
    )

# Create an interactive hover tool so that we can see the document. 
hover = HoverTool(
        tooltips=[
            ("Words", "@w"),
            ("Topic", "@t"),
        ]
    )

# Create the figure to draw the graph on. 
plt = figure(
    title="PCA Decomposition of BoW Space", 
    width=960, height=540, 
    tools="pan,box_zoom,reset,resize,save"
)

# Add the hover tool to the figure 
plt.add_tools(hover)

# Create the scatter plot with the PCA dimensions as the points. 
plt.scatter(
    'x', 'y', source=source, size=9,
    marker='circle_x', line_color='c', 
    fill_color='c', fill_alpha=0.5,
)

# Show the plot to render the JavaScript 
show(plt)


Another approach is to use the TSNE model for stochastic neighbor embedding. This is a very popular text clustering visualization/projection mechanism.


In [25]:
# Import the TSNE model family from the manifold package 
from sklearn.manifold import TSNE 
from sklearn.pipeline import Pipeline

# Instantiate the model form, it is usually recommended 
# To apply PCA (for dense data) or TruncatedSVD (for sparse)
# before TSNE to reduce noise and improve performance. 
tsne = Pipeline([
    ('svd', TruncatedSVD(n_components=75)),
    ('tsne', TSNE(n_components=2)),
])
                     
# Transform our TF-IDF vectors.
tsne_dvecs = tsne.fit_transform(docarr)

In [26]:
# Create a map using the ColorBrewer 'Paired' Palette to assign 
# Topic IDs to specific colors. 
cmap = {
    i: brewer['Paired'][10][i]
    for i in range(10)
}

# Create a tokens listing for our hover tool. 
tokens = [
    " ".join([
        lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)
    ][:10])
    for doc in tfidfvecs
]

# Create a Bokeh tabular data source to describe the data we've created. 
source = ColumnDataSource(
        data=dict(
            x=tsne_dvecs[:, 0],
            y=tsne_dvecs[:, 1],
            w=tokens,
            t=topics,
            c=[cmap[t] for t in topics],
        )
    )

# Create an interactive hover tool so that we can see the document. 
hover = HoverTool(
        tooltips=[
            ("Words", "@w"),
            ("Topic", "@t"),
        ]
    )

# Create the figure to draw the graph on. 
plt = figure(
    title="TSNE Decomposition of BoW Space", 
    width=960, height=540, 
    tools="pan,box_zoom,reset,resize,save"
)

# Add the hover tool to the figure 
plt.add_tools(hover)

# Create the scatter plot with the PCA dimensions as the points. 
plt.scatter(
    'x', 'y', source=source, size=9,
    marker='circle_x', line_color='c', 
    fill_color='c', fill_alpha=0.5,
)

# Show the plot to render the JavaScript 
show(plt)